Model Selection

Multimodal Question Answering

# Multimodal Question Answering

Docscopeocr 7B 050425 Exp

docscopeOCR-7B-050425-exp is a model fine-tuned based on Qwen/Qwen2.5-VL-7B-Instruct, focusing on document-level OCR, long-context visual language understanding, and accurate image-to-text conversion of mathematical LaTeX formats.

Transformers Supports Multiple Languages

Videochat R1 7B Caption

VideoChat-R1_7B_caption is a multimodal video-text generation model based on Qwen2-VL-7B-Instruct, focusing on video content understanding and description generation.

Transformers English

ViCA-7B is a vision-language model fine-tuned specifically for visual-spatial reasoning in indoor video environments. Built on the LLaVA-Video-7B-Qwen2 architecture and trained using the ViCA-322K dataset, it emphasizes structured spatial annotation and instruction-based complex reasoning tasks.

Transformers English

Deepseer R1 Vision Distill Qwen 1.5B Google Vit Base Patch16 224

DeepSeer is a vision-language model developed based on the DeepSeek-R1 model, supporting chain-of-thought reasoning and trained through dialogue templates for visual models.

mehmetkeremturkcan

VideoRefer-7B is a multimodal large language model focused on video question answering tasks, capable of understanding and analyzing spatiotemporal object relationships in videos.

Transformers English

LLaVA-SpaceSGG is a visual question-answering model based on LLaVA-v1.5-13b, focusing on scene graph generation tasks. It can understand image content and generate structured scene descriptions.

Text-to-Image English

Longvu Qwen2 7B

LongVU is a multimodal model based on Qwen2-7B, focusing on long video language understanding tasks and employing spatio-temporal adaptive compression technology.

Table Llava V1.5 7b

Table LLaVA 7B is an open-source multimodal chatbot specifically designed for understanding various table images and performing diverse table-related tasks.

Transformers English

The Monkey Model is a large multimodal model that excels in various visual tasks by enhancing image resolution and improving text labeling methods.

Instructblip Vicuna 13b

InstructBLIP is the visual instruction-tuned version of BLIP-2, based on the Vicuna-13b language model, designed for vision-language tasks.

Transformers English

Instructblip Flan T5 Xxl

InstructBLIP is the vision-instruction-tuned version of BLIP-2, capable of generating descriptions or answers based on images and text instructions

Transformers English

Video Blip Flan T5 Xl Ego4d

VideoBLIP is an enhanced version of BLIP-2 capable of processing video data, using Flan T5-xl as the backbone language model.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase